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Abstract 


We present a new algorithm for community detection. The algorithm 
uses random walks to embed the graph in a space of measures, after 
which a modification of fc-means in that space is applied. The algorithm 
is therefore fast and easily parallelizable. We evaluate the algorithm on 
standard random graph benchmarks, including some overlapping commu¬ 
nity benchmarks, and find its performance to be better or at least as good 
as previously known algorithms. We also prove a linear time (in number 
of edges) guarantee for the algorithm on a p, g-stochastic block model with 

where p > c ■ A’“2+'= and p — q > c'pN~2+'‘ log N. 

1 Introduction 

Community detection in graphs, also known as graph clustering, is a problem 
where one wishes to identify subsets of the vertices of a graph such that the 
connectivity inside the subset is in some way denser than the connectivity of the 
subset with the rest of the graph. Such subsets are referred to as communities, 
and it often happens in applications that if two vertices belong to the same 
community, they have similar application-related qualities. This in turn may 
allow for a higher level analysis of the graph, in terms of communities instead 
of individual nodes. Community detection finds applications in a diversity of 
fields, such as social networks analysis, communication and traffic design, in 
biological networks, and, generally, in most fields where meaningful graphs can 
arise (see, for instance, [T] for a survey). In addition to direct applications 
to graphs, community detection can, for instance, be also applied to general 
Euclidean space clustering problems, by transforming the metric to a weighted 
graph structure (see [2] for a survey). 

Community detection problems come in different flavours, depending on 
whether the graph in question is simple, or weighted, or/and directed. An¬ 
other important distinction is whether the communities are allowed to overlap 
or not. In the overlapping communities case, each vertex can belong to several 
subsets. 
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A difficulty with community detection is that the notion of community is 
not well defined. Different algorithms may employ different formal notions of 
a community, and can sometimes produce different results. Nevertheless, there 
exist several widely adopted benchmarks - synthetic models and real-life graphs 
- where the ground truth communities are known, and algorithms are evaluated 
based on the similarity of the produced output to the ground truth, and based on 
the amount of required computations. On the theoretical side, most of the effort 
is concentrated on developing algorithms with guaranteed recovery of clusters 
for graphs generated from variants of the Stochastic Block Model (referred to 
as SBM in what follows, my 

In this paper we present a new algorithm, DER (Diffusion Entropy Reducer, 
for reasons to be clarified later), for non-overlapping community detection. The 
algorithm is an adaptation of the k-means algorithm to a space of measures 
which are generated by short random walks from the nodes of the graph. The 
adaptation is done by introducing a certain natural cost on the space of the 
measures. As detailed below, we evaluate the DER on several benchmarks and 
find its performance to be as good or better than the best alternative method. 
In addition, we establish some theoretical guarantees on its performance. While 
the main purpose of the theoretical analysis in this paper is to provide some 
insight into why DER works, our result is also one of the very few results in the 
literature that show reconstruction in linear time. 

On the empirical side, we first evaluate our algorithm on a set of random 
graph benchmarks known as the LFR models, [3]. In [4], 12 other algorithms 
were evaluated on these benchmarks, and three algorithms, described in 0, m 
and [7] , were identified, that exhibited significantly better performance than the 
others, and similar performance among themselves. We evaluate our algorithm 
on random graphs with the same parameters as those used in [3] and find its 
performance to be as good as these three best methods. Several well known 
methods, including spectral clustering [8], exhaustive modularity optimization 
(see [3] for details), and clique percolation [5], have worse performance on the 
above benchmarks. 

Next, while our algorithm is designed for non-overlapping communities, we 
introduce a simple modification that enables it to detect overlapping commu¬ 
nities in some cases. Using this modification, we compare the performance of 
our algorithm to the performance of 4 overlapping community algorithms on a 
set of benchmarks that were considered in [10] . We find that in all cases DER 
performs better than all 4 algorithms. None of the algorithms evaluated in |4| 
and [3] has theoretical guarantees. 

On the theoretical side, we show that DER reconstructs with high probability 
the partition of the p, g-stochastic block model such that, roughly, p > N“ 2 , 

where N is the number of vertices, and p — q > c\J pN~^^^ log N (this holds 
in particular when ^ > c' > 1) for some constant c > 0. We show that for 
this reconstruction only one iteration of the fc-means is sufficient. In fact, three 
passages over the set of edges suffice. While the cost function we introduce 
for DER will appear at first to have purely probabilistic motivation, for the 
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purposes of the proof we provide an alternative interpretation of this cost in 
terms of the graph, and the arguments show which properties of the graph are 
useful for the convergence of the algorithm. 

Finally, although this is not the emphasis of the present paper, it is worth 
noting here that, as will be evident later, our algorithm can be trivially paralle- 
lalized. This seems to be a particularly nice feature since most other algorithms, 
including spectral clustering, are not easy to parallelalize and do not seem to 
have parallel implementations at present. 

The rest of the paper is organized as follows: Section[2]overviews related work 
and discusses relations to our results. In Section [3] we provide the motivation 
for the definition of the algorithm, derive the cost function and establish some 
basic properties. Section |4] we present the results on the empirical evaluation 
of the algorithm and Section [5] describes the theoretical guarantees and the 
general proof scheme. Some proofs and additional material are provided in the 
supplementary material. 


2 Literature review 

Community detection in graphs has been an active research topic for the last two 
decades and generated a huge literature. We refer to [1] for an extensive survey. 
Throughout the paper, let G = {V, E) be a graph, and let P = Pi,..., Pfc be a 
partition of V. Loosely speaking, a partition P is a good community structure 
on G if for each Pi G P, more edges stay within Pi than leave Pi. This is 
usually quantified via some cost function that assigns larger scalars to partitions 
P that are in some sense better separated. Perhaps the most well known cost 
function is the modularity, which was introduced in m and served as a basis of 
a large number of community detection algorithms ([!])■ The popular spectral 
clustering methods, i; i, can also be viewed as a (relaxed) optimization of a 
certain cost (see [2]). 

Yet another group of algorithms is based on fitting a generative model of a 
graph with communities to a given graph. References ffU; cni are two among 
the many examples. Perhaps the simplest generative model for non-overlapping 
communities is the stochastic block model, see which we now define: Let 

P = Pi,..., Pfe be a partition of V into k subsets, p, g-SBM is a distribution 
over the graphs on vertex set V , such that all edges are independent and for 
i,j G V, the edge {i,j) exists with probability p if i,j belong to the same Pg, 
and it exists with probability q otherwise, li q « p, the components Pi will 
be well separated in this model. We denote the number of nodes by V = |P| 
throughout the paper. 

Graphs generated from SBMs can serve as a benchmark for community de¬ 
tection algorithms. However, such graphs lack certain desirable properties, such 
as power-law degree and community size distributions. Some of these issues 
were fixed in the benchmark models in 0; m, and these models are referred 
to as LFR models in the literature. More details on these models are given in 
Section [H 
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We now turn to the discussion of the theoretical guarantees. Typically results 
in this direction provide algorithms that can reconstruct,with high probability, 
the ground partition of a graph drawn from a variant of a p, q-SBM model, 
with some, possibly large, number of components k. Recent results include the 
works m and m- In this paper, however, we shall only analytically analyse 
the k = 2 case, and such that, in addition, |Pi| = |P 2 |- 

For this case, the best known reconstruction result was obtained already in 
m and was only improved in terms of runtime since then. Namely, Bopanna’s 
result states that if p > ci and p — q> C 2 , then with high probability 
the partition is reconstructible. Similar bound can be obtained, for instance, 
from the approaches in HU; HU, to name a few. The methods in this group 
are generally based on the analysis of the spectrum of the adjacency matrix. 
The run time of these algorithms is non-linear in the size of the graph and 
it is not known how these algorithms behave on graphs not generated by the 
probabilistic models that they assume. 

It is generally known that when the graphs are dense (p of order of constant), 
simple linear time reconstruction algorithms exist (see [18)1. The first, and to 
the best of our knowledge, the only previous linear time algorithm for non dense 
graphs was proposed in [18) . This algorithm works for p > C 3 {e)N~ 2 +’^^ for any 
fixed e > 0. The approach of HH] was further extended in Hl]i to handle 
more general cluster sizes. These approaches approaches differ significantly 
from the spectrum based methods, and provide equally important theoretical 
insight. However, their empirical behaviour was never studied, and it is likely 
that even for graphs generated from the SBM, extremely high values of N 
would be required for the algorithms to work, due to large constants in the 
concentration inequalities (see the concluding remarks in HS])- 


3 Algorithm 


Let G be a finite undirected graph with a vertex set V = {1,... ,n}. Denote 
hy A = {uij} the symmetric adjacency matrix of G, where aij > 0 are edge 
weights, and for a vertex i G V, set di = o-ij to be the degree of i. Let D 
be an n X n diagonal matrix such that Da = di, and set T = D~^A to be the 
transition matrix of the random walk on G. Set also Py = Ty. Finally, denote 
by TT, 7r(z) = the stationary measure of the random walk. 


A number of community detection algorithms are based on the intuition 
that distinct communities should be relatively closed under the random walk 
(see m), and employ different notions of closedness. Our approach also takes 
this point of view. 

For a fixed L G N, consider the following sampling process on the graph: 
Choose vertex uq randomly from tt, and perform L steps of a random walk on 
G, starting from vq. This results in a length L -I- 1 sequence of vertices, . 
Repeat the process N times independently, to obtain also x^,. .. . 

Suppose now that we would like to model the sequences x® as a multinomial 
mixture model with a single component. Since each coordinate xf is distributed 


4 





according to tt, the single component of the mixture should be tt itself, when 
N grows. Now suppose that we would like to model the same sequences with 
a mixture of two components. Because the sequences are sampled from a ran¬ 
dom walk rather then independently from each other, the components need no 
longer be tt itself, as in any mixture where some elements appear more often 
together then others. The mixture as above can be found using the EM al¬ 
gorithm, and this in principle summarizes our approach. The only additional 
step, as discussed above, is to replace the sampled random walks with their true 
distributions, which simplifies the analysis and also leads to somewhat improved 
empirical performance. 

We now present the DER algorithm for detecting the non-overlapping com¬ 
munities. Its input is the number of components to detect, k, the length of the 
walks L, an initialization partition P = {Pi,Pk} oi V into disjoint subsets. 
P would be usually taken to be a random partition of V into equally sized 
subsets. 

Eor t = 0,1,... and a vertex i G V, denote by the i-th row of the matrix 
T*. Then wj is the distribution of the random walk on G, started at i, after t 
steps. Set Wi = -I- ... -I- wf), which is the distribution corresponding to 

the average of the empirical measures of sequences x that start at i. 

For two probability measures v, /i on V, set 


= ^^(i)logAi(i). 


Although D is not a metric, will act as a distance function in our algorithm. 
Note that if v was an empirical measure, then, up to a constant, D would be 
just the log-likelihood of observing v from independent samples of /r. 

For a subset S CV, set tts to be the restriction of the measure tt to S, and 
also set ds = degree of S. Let 



y^djWj 

i&S 


( 1 ) 


denote the distribution of the random walk started from tts- 
The complete DER algorithm is described in Algorithm [TJ 
The algorithm is essentially a k-means algorithm in a non-Euclidean space, 
where the points are the measures Wi, each occurring with multiplicity di. Step 

(1) is the “means” step, and (2) is the maximization step. 

Let 

L 

^ = EE di' Di^^i, fil') (2) 

i=i iePi 

be the associated cost. As with the usual k-means, we have the following 

Lemma 3.1. Either P is unchanged by steps (1) and (2) or both steps (1) and 

(2) strictly increase the value of C. 
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Algorithm 1 DER 

1: Input: Graph G, walk length L, 
number of components k. 

2 : Compute the measures Wi. 

3: Initialize Pi,..., to be a random partition such that 
\Pi\ = \V\/k for all i. 

4: repeat 

5: (1) For all s < k, construct /i^, = /ip^. 

6: (2) For all s < k, set 



7: until the sets Pg do not change 


The proof is by direct computation and is deferred to the supplementary 
material. Since the number of configurations P is finite, it follows that DER 
always terminates and provides a “local maximum” of the cost C. 

The cost C can be rewritten in a somewhat more informative form. To do so, 
we introduce some notation first. Let A be a random variable on V, distributed 
according to measure tt. Let Y a step of a random walk started at X, so that 
the distribution of Y given X = i is Wi. Finally, for a partition P, let Z be the 
indicator variable of a partition, Z = s iS X G Pg- With this notation, one can 
write 


C = -dv H{Y\Z) = dv {-H{Y) + H{Z) - H{Z\Y )), 


(3) 


where H are the full and conditional Shannon entropies. Therefore, DER algo¬ 
rithm can be interpreted as seeking a partition that maximizes the information 
between current known state (Z), and the next step from it {Y). This inter¬ 
pretation gives rise to the name of the algorithm, DER, since every iteration 
reduces the entropy H{Y\Z) of the random walk, or diffusion, with respect to 
the partition. The second equality in ([3]) has another interesting interpretation. 
Suppose, for simplicity, that fc = 2, with partition Pi,P 2 . In general, a cluster¬ 
ing algorithm aims to minimize the cut, the number of edges between Pi and 
P 2 . However, minimizing the number of edges directly will lead to situations 
where Pi is a single node, connected with one edge to the rest of the graph in 
P 2 . To avoid such situation, a relative, normalized version of a cut needs to 
be introduced, which takes into account the sizes of Pi,P 2 . Every clustering 
algorithms has a way to resolve this issue, implicitly or explicitly. For DER, this 
is shown in second equality of ([3]). H{Z) is maximized when the components 
are of equal sizes (with respect to tt), while H{Z\Y) is minimized when the 
measures fip^ are as disjointly supported as possible. 

As any fc-means algorithm, DER’s results depend somewhat on its random 
initialization. All fc-means-like schemes are usually restarted several times and 
the solution with the best cost is chosen. In all cases which we evaluated we 
observed empirically that the dependence of DER on the initial parameters is 
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(a) Karate Club 


(b) Political Blogs 


rather weak. After two or three restarts it usually found a partition nearly as 
good as after 100 restarts. For clustering problems, however, there is another 
simple way to aggregate the results of multiple runs into a single partition, 
which slightly improves the quality of the final results. We use this technique in 
all our experiments and we provide the details in the Supplementary Material, 
Section A. 

We conclude by mentioning two algorithms that use some of the concepts 
that we use. The Walktrap, [50], similarly to DER constructs the random walks 
(the measures Wi, possibly for L > 1) as part of its computation. However, 
Walktrap uses w^’s in a completely different way. Both the optimization proce¬ 
dure and the cost function are different from ours. The Infomap , E, m, has 
a cost that is related to the notion of information. It aims to minimize to the 
information required to transmit a random walk on G through a channel, the 
source coding is constructed using the clusters, and best clusters are those that 
yield the best compression. This does not seem to be directly connected to the 
maximum likelyhood motivated approach that we use. As with Walktrap, the 
optimization procedure of Infomap also completely differs from ours. 


4 Evaluation 

In this section results of the evaluation of DER algorithm are presented. In 
Section 14.11 we illustrate DER on two classical graphs. Sections 14.21 and 14.31 
contain the evaluation on the LER benchmarks. 

4.1 Basic examples 

When a new clustering algorithm is introduced, it is useful to get a general feel 
of it with some simple examples. EigurefTal shows the classical Zachary’s Karate 
Club, [22]. This graph has a ground partition into two subsets. The partition 
shown in Eigure[Ta]is a partition obtained from a typical run of DER algorithm, 
with fc = 2, and wide range of L’s. {L G [1,10] were tested). As is the case with 
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many other clustering algorithms, the shown partition differs from the ground 
partition in one element, node 8 (see m)- 

Figure [Tbl shows the political blogs graph, [^. The nodes are political blogs, 
and the graph has an (undirected) edge if one of the blogs had a link to the other. 
There are 1222 nodes in the graph. The ground truth partition of this graph 
has two components - the right wing and left wing blogs. The labeling of the 
ground truth was partially automatic and partially manual, and both processes 
could introduce some errors. The run of DER reconstructs the ground truth 
partition with only 57 nodes missclassifed. The NMI (see the next section, Eq. 
O) to the ground truth partition is .74. 

The political blogs graphs is particularly interesting since it is an example 
of a graph for which fitting an SBM model to reconstruct the clusters produces 
results very different from the ground truth. It can also be easily checked that 
spectral clustering, in form given in [S], is not close to ground truth when k = 2. 
It is close to ground truth when fc = 3, however. To overcome the problem with 
SBM fitting on this graph, a degree sensitive version of SBM was introduced in 
[23]. That algorithm produces partition with NMI .75. 


4.2 LFR benchmarks 


The LER benchmark model, is a widely used extension of the stochastic 
block model, where node degrees and community sizes have power law distribu¬ 
tion, as often observed in real graphs. An important parameter of this model is 
the mixing parameter /i G [0,1] that controls the fraction of the edges of a node 
that go outside the node’s community (or outside all of node’s communities, in 
the overlapping case). For small /x, there will be a small number of edges going 
outside the communities, leading to disjoint, easily separable graphs, and the 
boundaries between communities will become less pronounced as grows. 

Given a set of communities P on a graph, and the ground truth set of 
communities Q, there are several ways to measure how close P is to Q. One 
standard measure is the normalized mutual information (NMI), given by: 


NMI{P, Q) 


g np,Q) 

H{P) + H{Q) ’ 


(4) 


where H is the Shannon entropy of a partition and I is the mutual information 
(see [1] for details). NMI is equal I if and only if the partitions P and Q coincide, 
and it takes values between 0 and 1 otherwise. 

When computed with NMI, the sets inside P, Q can not overlap. To deal with 
overlapping communities, an extension of NMI was proposed in [25] . We refer 
to the original paper for the definition, as the definition is somewhat lengthy. 
This extension, which we denote here as ENMI, was subsequently used in the 
literature as a measure of closeness of two sets of communities, event in the cases 
of disjoint communities. Note that most papers use the notation NMI while the 
metric that they really use is ENMI. 

Eigure[Tc] shows the results of evaluation of DER for four cases: the size of 
a graph was either N = 1000 or N = 5000 nodes, and the size of the communi- 





(c) DER, LFR benchmarks (d) Spectral Alg., LFR benchmarks 


ties was restricted to be either between 10 to 50 (denoted S in the figures) or 
between 20 to 100 (denoted B). For each combination of these parameters, /i 
varied between 0.1 and 0.8. For each combination of graph size, community size 
restrictions as above and /r value, we generated 20 graphs from that model and 
run DER. To provide some basic intuition about these graphs, we note that the 
number of communities in the lOOOS graphs is strongly concentrated around 40, 
and in lOOOB, 5000S, and 5000B graphs it is around 25, 200 and 100 respec¬ 
tively. Each point in Figure [T^ is a the average ENMl on the 20 corresponding 
graphs, with standard deviation as the error bar. These experiments correspond 
precisely to the ones performed in [3] (see Supplementary Material, Section Cfor 
more details). In all runs on DER we have set L = 5 and set k to be the true 
number of communities for each graph, as was done in [4] for the methods that 
required it. Therefore our Figure [Tc] can be compared directly with Figure 2 in 

i- 

From this comparison we see that DER and the two of the best algorithms 
identified in [4], Infomap [5] and RN [^, reconstruct the partition perfectly for 
fi < 0.5, for /i = 0.6 DER’s reconstruction scores are between Infomap’s and 
RN’s, with values for all of the algorithms above 0.95, and for n = 0.7 DER has 
the best performance in two of the four cases. For p, = 0.8 all algorithms have 
score 0. 

We have also performed the same experiments with the standard version of 
spectral clustering, [5] , because this version was not evaluated in [i] . The results 
are shown in Fig. Ildl Although the performance is generally good, the scores 
are mostly lower than those of DER, Infomap and RN. 

4.3 Overlapping LFR benchmarks 

We now describe how DER can be applied to overlapping community detection. 
Observe that DER internally operates on measures fip^ rather then subsets of 
the vertex set. Recall that ^p^(i) is the probability that a random walk started 
from Ps will hit node i. We can therefore consider each z to be a member of those 
communities from which the probability to hit it is “high enough”. To define 
this formally, we first note that for any partition P, the following decomposition 
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Table 1: Evaluation for Overlapping LFR. All values except DER are from [TO] 


Alg. 

fi = 0 

II 

o 

II 

o 

DER 

0.94 

0.9 

0.83 

svi (nuj) 

0.89 

0.73 

0.6 

POI ([26]) 

0.86 

0.68 

0.55 

INF ([n]) 

0.42 

0.38 

0.4 

COP (|27|) 

0.65 

0.43 

0.0 


holds: 

k 

= ^T^{Ps)l^Ps- ( 5 ) 

s=l 

This follows from the invariance of tt under the random walk. Now, given the 
out put of DER - the sets Ps and measures set 


TOi(s) 


flp^{i)TT{Ps) 

Ef=iAiPt(*)7r(Tt) 


HP^{i)Tr{Ps) 

7r(i) 


( 6 ) 


where we used ([5]) in the second equality. Then mi{s) is the probability that 
the walks started at Pg, given that it finished in i. For each i G V, set 
Si = argmax; mi(l) to be the most likely community given i. Then define the 
overlapping communities Ci,... ,Ck via 


Ct = {iGV I mi{t) > - ■ mi{si) 


(7) 


The paper [lOj introduces a new algorithm for overlapping communities de¬ 
tection and contains also an evaluation of that algorithm as well as of several 
other algorithms on a set of overlapping LFR benchmarks. The overlapping 
communities LFR model was defined in [5] . In Table [T] we present the ENMI 
results of DER runs on the N = 10000 graphs with same parameters as in cni, 
and also show the values obtained on these benchmarks in [TU] (Figure S4 in 
[To]), for four other algorithms. The DER algorithm was run with L = 2, and 
k was set to the true number of communities. Each number is an average over 
ENMIs on 10 instances of graphs with a given set of parameters (as in nni). 
The standard deviation around this average for DER was less then 0.02 in all 
cases. Variances for other algorithms are provided in m- 

For /r > 0.6 all algorithms yield ENMI of less then 0.3. As we see in Table [U 
DER performs better than all other algorithms in all the cases. We believe this 
indicates that DER together with equation © is a good choice for overlapping 
community detection in situations where community overlap between each two 
communities is sparse, as is the case in the LFR models considered above. 
Further discussion is provided in the Supplementary Material, Section D. 

We conclude this section by noting that while in the non-overlapping case 
the models generated with fi = 0 result in trivial community detection problems, 
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because in these cases communities are simply the connected components of the 
graph, this is no longer true in the overlapping case. As a point of reference, the 
well known Clique Percolation method was also evaluated in m, in the ^ = 0 
case. The average ENMI for this algorithm was 0.2 (Table S3 in [TU]1. 

5 Analytic bounds 

In this section we restrict our attention to the case L = 1 of the DER algorithm. 
Recall that the p, g-SBM model was defined in Section [51 We shall consider the 
model with k = 2 and such that |Pi| = |P 2 |- We assume that the initial 
partition for the DER, denoted Ci,C 2 in what follows, is chosen as in step 3 of 
DER (Algorithm [T]) - a random partition of V into two equal sized subsets. 

In this setting we have the following: 

Theorem 5.1. For every e > 0 there exists C > 0 and c > 0 such that if 

p>C-N-i+'^ ( 8 ) 

and 



then DER recovers the partition Pi, P 2 after one iteration, with probability (f>{N) 
such that (f^N) —>■ 1 when iV —>■ 00 . 

Note that the probability in the conclusion of the theorem refers to a joint 
probability of a draw from the SBM and of an independent draw from the 
random initialization. 

The proof of the theorem has essentially three steps. First, we observe 
that the random initialization Ci , C 2 is necessarily somewhat biased, in the 
sense that Cx and C 2 never divide Pi exactly into two halves. Specifically, 
IICi n Pi| — IC 2 n Pill > with high probability. Assume that Ci has 

the bigger half, ICi fl Pi| > \C 2 H Pi|. In the second step, by an appropri¬ 
ate linearization argument we show that for a node i G Pi, deciding whether 
D{wi, pci) > D{wi, PLC 2 ) O'" ’^ice versa amounts to counting paths of length two 
between i and |Ci fl Pi|. In the third step we estimate the number of these 
length two paths in the model. The fact that \Ci fl Pi| > \C 2 fl Pi| -I- N~^~^ 
will imply more paths to Ci fl Pi from i G Pi and we will conclude that 
D{wi,p,ci) > D{w^,pc2) for all i G Pi and D{wi,pc2) > D{wi,fj,Ci) for all 
i G P 2 - The full proof is provided in the supplementary material. 

We note that the use of paths of length two is essential for the argument to 
work. Similar argument with paths of length one (edges) will not work (unless 
p is of the order of a constant). However, we also note that paths of length 
two are never explicitly computed, as this would require squaring the adjacency 
matrix. Instead, this is achieved by considering paths of length one from the 
target set Ci (via pci) and paths of length one from the nodes (via Wi). 
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A Restarts and repeats 

As any /c-means algorithm, DER’s results depend somewhat on its random ini¬ 
tializations, and can be improved by multiple runs on the same instance with 
different initializations. We refer to this as restarts of the algorithm. We have 
observed empirically the following behaviour of DER: Suppose a graph G has a 
ground truth partition Pi, ... ,Pk. Then the output of a typical restart of DER 
will be a partition Ci,... ,Ck with the property that for each Ci,i < k, either 
there is j < /c such that Ci = Pj, or there are ji,j2 such that Ci = Pj^ U Pj^ 
or there are j and I such that CiVJ Ci = Pj. In other words, DER tends to 
either find the precise cluster, or to glue together two original clusters, or split 
an original cluster into two parts. Usually most of the clusters will be found 
precisely, and there will be some small number of (usually small) clusters that 
are glued or splitted. Which clusters will be glued or splitted would depend on 
the random initialization. An simple way to deal with this is to use the following 
“repeats” strategy: Choose a number of repeats, R (say, R = 5) and run DER 
R times. Construct the node co-occurence matrix: 

Rij = number of runs such that i and j appear in the same cluster. (10) 
for all i,j G V. 

The matrix R can now be regarded as an adjacency matrix of a weighted 
graph and can be clustered itself. However, R will often have very clear clusters, 
which can be found using the following trivial threshold algorithm: Define T = 
\R/2\. Initialize a set U = V. Choose an arbitrary i G U and define a cluster 
C by 

C = {jGU I % > T}. 

Then output cluster C, set U = U \ C, choose a new i G U and repeat until U 
is empty. 

While on the benchmarks a single run of DER with a single restart usually 
has quite high precision, repeats are a more effective way to deal with glueing and 
splitting than the restarts. It is of course also possible to use more sophisticated 
but slower algorithms instead of the threshold one to cluster the co-occurence 
matrix R. 


B Proofs 

B.l Lemma 3.1 

Proof Of Lemma 3.1: The claim is obvious for step (2) of the algorithm. Eor 
step (I) the claim is implied by the following standard fact: Let i>i,iy 2 , ■. ■ ,r>z 
be any finite collection of measures. Set v = \ 'Yhi ^i- Then for any measure k, 

Z Z 

'^D{vi,K) <'^D{vi,v). (II) 

Z=1 
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Indeed, by rearranging terms in m, we get 


jev \i=i / 


(log£>(j) -log/tO’)) = 


2 • E 

jev 



k(j)/ 


> 0 


which is the non-negativity of the Kullback-Leibler divergence [5S] , with equal¬ 
ity iS K = V. □ 


B.2 Main result 

We now prove Theorem 5.1, which we restate here for convenience. 

Theorem B.l. For every e > 0 there exists C > 0 and c > 0 such that if 

p>C-N-^+^ ( 12 ) 


and 

p — q> c\J\ogN (13) 

then DER recovers the partition Pi, P 2 after one iteration, with probability (f>{N) 
such that (f^N) —>■ 1 when iV —>■ 00 . 

Recall that a general plan of the proof was discussed in Section 5. We proceed 
to implement that plan. We start with stating some preliminaries. First, we 
state a version of Chernoff’s bound for binomial variables. 


Theorem B.2 (Theorem 2.1 in [H]). Let X ^ Bin{n,p) be a binomial variable 
and set A = np. Then for all t > 0, 

P(A->EX + ,)<».p(-5^^) ,14) 

P (X < EX — t) < exp 

In general given a binomial X ^ Bin{n,p) we will often refer to X = np as 
X’s lambda. 

The following Corollary will be useful. 

Corollary B.3 (Corrolary 2.3 in [29]). Let X ^ Bin(n,p) be a binomial vari¬ 
able. Then for all e < 



P(|X -EX| > e-EX) < 2exp 



We will also often use the following Corollary of Theorem IB. 21 


(16) 
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Corollary B.4. There is a constant c > 0 such that the following holds: 

Let X ^ Bin{n,p) be a binomial variable such that X = np > 1. Then for any 
N>0, 

¥(\X-EX\>20-VX-\ogN'^ <c/N^. (17) 

We now present a series of Lemmas about random graphs in the p, q- SBM 
model and about random initializations. Throughout G = (V, E) will be as¬ 
sumed to be a random graph from the p, g-SBM and we denote this G ~ Qp,q- 
Recall that N = \V\ is the size of the node set, and for a node * S F in a fixed 
graph G, Hi is the set of neighbours of i, and di = \ni\ is the degree of i. Also, 
for a set S' C V, its full degree is ds = for a set S C V, we 

denote by d{i, S) = \ni fl S| the number of edges between i and S and for two 
sets, S,T CV define d{S, T) = J2ies fo be the number of edges between 

S and T. Finally, set d 2 {i,T) = d{ni,T) to be the number of paths of length 
two that start at i and end at T. 

In addition, let (71,(72, with |(7i| = \C 2 \ = N/2, be a random partition 
of V into two sets, the initialization of DER. Denote A^i = |(7i fl Pi|, and 
N 2 = N/2 — Ni = |(7i n P 2 I = I(72 n Pi I. We assume without loss of generality 
that Ni> N 2 , and set AN = Ni — N 2 . The partition (7i, G 2 will be considered 
fixed in all the lemmas that concern the random graphs. 

We proceed to give bounds on the expectations and concentration intervals 
of several quantities related to our problem. 

For a fixed node i € V, the degree di is distributed as a sum of two indepen¬ 
dent binomials, 

di'-- BiniN/2-l,p) + Bin{N/2,q), (18) 

the first term counts the edges to the component to which i belongs, the second 
to the other component. In particular, the expected degree is 

Ed, = (7V/2 - l)p + {N/2)q. (19) 

Lemma B.5 (Degree bounds). Let G ^ Qp,q- There exists a constant ci such 
that the following holds: Assume that 

TVp > lOOlogiV. (20) 

Then with probability at least 1 — ci/N, for all i gV 

1 N 

^ • yP <di <2- Np. (21) 

Proof. Fixed a node i GV, and let X ^ Bin{N/2 — l,p) and Y ^ Bin{N/2,q) 
be two independent binomials such that di X + Y. By applying (TlBll to X 
with e = i, we obtain that 

IN 1 

-—p<EX--EX<X<di (22) 
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with probability at least 1 — 2exp{—^{^-p — 1)). Using the assumption (EiH) . 
it follows that there is c > 0 such that 2exp{—^{^-p— 1)) < cjN'^. Using the 
union bound we therefore conclude that 

1 N 

(23) 

holds for all nodes i G V with probability at least 1 — c/N. Similarly, we use 
(O to obtain that X < Np with probability at least 1 — c/N"^, perhaps with a 
different c and that Y < Np with probability at least 1 — c'/N'^, because q < p. 
By the union bound it follows that di = X + Y < 2Np with probability at least 
1 — (c+c')/-/V^, and by the union bound again, we obtain di < 2Np for all i gV, 
with probability ate least 1 — c" /N. □ 

In what follows we will often encounter situations where we need to bound 
fluctuations of sums of a fixed number of not necessarily independent random 
variables, and considerations similar to those in Lemma lB.51 will often be omit¬ 
ted. 

We now consider the degree of Ci, dc^- Note that by symmetry = 

Edcs j that the total degree of the graph satisfies da = dci + dc 2 ■ Therefore 

Edc, = ^dc = NEd, = N {{N/2 - l)p + {N/2)q). (24) 

The next lemma concerns the concentration of the degree of Ci. 

Lemma B.6. Set A = N'^p. There exist constants 63,04 such that with proba¬ 
bility at least 1 — 63 /N, 


\dci - Edci I < C 4 log N ■ V\. (25) 

Proof. For l,s G {1, 2}, set Sis = Cid Ps- Observe that dci can be written as 

dc,= 2 - d{Sn,Sn) + 2 ■ d{Si2, S42) + 2 ■ d( 5 ii, ^12) + 

+d{Sn,S2i) + d{Sn,S22) + 

+ d{Sl 2 ,S 2 l)+d{Sl 2 ,S 22 ). 

Note that each of the terms in the sum above is a binomial variable with lambda 
that is smaller or equal to cN^p for some constant c > 0. Therefore by applying 
Corollary IB. 41 to each term and using union bound, we obtain the result. □ 

The next Lemma provides an upper bound on AN. 

Lemma B.7. There are constants Ci,C 2 > 0 such that 

AN <ciy/NlogN (26) 

with probability at least 1 — C 2 /N. 
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Proof. For the purposes of this lemma we do not assume that > N 2 . Recall 
that Ni is the size of the intersection Pi with a random subset of V of size N/2, 
denoted Ci. Hence Ni has has the hypergeometric distribution. Set 

A = EiVi = = iiV. (27) 

The hypergeometric distribution satisfies concentration inequalities similar to 
those satisfied by the binomials. Specifically, by Theorem 2.10 in [29], the con¬ 
clusion of Corollarv lB.41 inequality (|T7)) holds for hypergeometric variables, with 
A is defined as in (l?7l) . The result follows by an application of that inequality. □ 

We now examine the quantity d{j,C2) for a node j G V. The expectations 
satisfy 


Ed(j, C2) = N 2 P + Niq if j G Pi (28) 

Ed{j, C2) = Nip + N2q if j G P2. (29) 

This follows from the decomposition of d{j, C 2 ) as a sum of two binomials. 
Similar expressions hold also for d{j^ Ci). Note that when, for instance j G Pi, 
in fact Ed(j, C2) = N2P + Niq if j G Ci fl Pi, and Ed{j, C2) = {N2 — V)p + Niq 
if j G Cl n Pi. Since we will be interested only in orders of magnitude, we will 
disregard the difference between the two cases in what follows. Throughout the 
proof we denote 

L = N2P + Niq (30) 

as a convenient shorthand for Ed{j, C2) (when j G Pi). 

The quantities in the following Lemma will be relevant in what follows: 


Lemma B. 8 . Assume that the partition Ci,C 2 is such that 

AN ^cy/NlogN. (31) 

Then there exist constants ci, 02 , 03,04 > 0 and /ci > 0 such that if Np > m 
then with probability at least 1 — 77 the following holds: For all j G V, 

d{3,C2)>C2Np (32) 

\d{j.Ci) - d{j,C2)\ < 03 v^logiV (33) 

d{j,Ci)/d(,j,C2)>\ (34) 

\d{j,C2) - L\< c4,^/Np\ogN. (35) 


Proof. We show that the statements hold for every j G V individually with 
probability at least 1 — 04 from which the claim of the Lemma follows by 
the union bound. 

Using inequality (II3, we obtain that with probability at least 1 — 05 /7V^, 
\d{j, C 2 ) - Ed{j, C 2 )\< oev^logiV, (36) 
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and similarly 


\d{j,Ci)-Ed{j,Ci)\ < cev^logiV, (37) 

where in a way similar to the proof of Lemma IB. 51 we have used the decompo¬ 
sition of d{j, Cl) into two binomials and the fact that q < p. 

Assume that Np is large enough so that 

CG^/Np\ogN < -^Np (38) 

holds. 

By using the assumption m cinci (1281) or (j29p . wc obt3<in 

Ed{j,C2) > \np 

for all N > K 2 for some constant K 2 > 0. Combining this with (1361) and with 
((551) . we obtain 

d{j, C 2 ) > Ed{j, C 2 ) - ceV^logiV > (i - ^)Np, (39) 

thereby proving (15^ . Next, using (1^ . (1^ and similar expressions for d{j, Ci) 
we obtain that 

\Ed{j, Cl) - Edij, C 2 )\ = AN ip - q). (40) 

Using (l40l) with (IMll and (|37|) . it follows that 

|d(j,Cl) -dij,C 2 )\ < cANp + c'^/Wp\ogN < cs^/NplogN, (41) 


for appropriate constants c, c' > 0. This proves (1551 Similarly, the claim (I55|) 
holds for all j G Pi and for j G P 2 we have 

\dij,C 2 )-L\ < \L-Edij,C 2 )\+c"y^logN< 

< cANp + c" yjNp log N < Cg \/Np log N. 


Thus (155)1 holds for all j G U. Finally, to show (1551) write 

dij,C^) ^ _ d(j,Ci)-d(j,C2) 

dij,C2) dij,C2) ■ ^ ^ 

Then (154)1 holds if I I — 5 holds, which in turn holds by (1521) and 

()551) , for N and Np larger than some fixed constant. □ 


We now provide some estimates on the number of length two paths (which 
we also referr to as 2-paths in what follows). 

Lemma B.9. For a node j G Pi, 

Ed2(j, Cl) = ^N {Nip^ + 2pqN2 + Niq'^) 

^d 2 {j, C 2 ) = ^N [Ngp^ + 2pqNi + N 2 q^) 
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(43) 

(44) 






Proof. For l,s G {1, 2}, set Sis = C; H Pg. There are four types of 2-paths from 
j to Cl- Those that land in Pi at first step, and then land at S'!!. We denote 
paths of this type by Pis'll. There exist ■ Ni such possible paths and each 
one exists in Qp g model with probability p^. For some concrete path of type 
PiSii, say p = j,u,v, with u G Pi and v G Sn, let Ep be the event that this 
path exists in the graph. The number of such paths is then X^pePiSn 
the expected number of such paths is therefore ^NNip'^. The other path types 
are P 1 S 12 , P 2 Sii,P 2 Si 2 , with expected numbers of paths \NN 2 pq,\NNiq^ and 
^NN 2 pq respectively. Hence (H51) holds. Similar considerations yield (H^ . □ 

Next we obtain concentration bounds on ^ 2 . 

Lemma B.IO. There are constants ci, C 2 > 0, such that with probability at least 
1 — ci/N the following holds: For all i G Pi, 


\d 2 {i,Ci) - Ed 2 {i,Ci)\ < ciNplogN 

(45) 

\d 2 {i,C 2 )-Ed 2 {i,C 2 )\ < CiNplogN 

(46) 

Proof. Let m be the neighbourhood of i in G. Set as before Sis = Gi fl Pg 
for l,s G {1, 2} and set also Aig = Sis H Similarly to the arguments in the 

previous Lemmas, to obtain concentration bounds on d2(i,Gi) 
as a sum of binomials 

we represent it 

d2{i,Gi)= Y. Y. d{Als,Str). 


i,se{l, 2 } t.re{l, 2 } 


Then one observes that the lambda of each such binomial is of the order Np-N-p, 
because the size of Ais is of the order of Np and the size of Str is of the order 
of N. Then the conclusion follows by inequality (fT71). Since the sets Ais are 
random sets, to carry the above argument precisely we first condition on the 

neighbourhood of and ensure (using (ITCl)') that the sets Aig 

are indeed not 

larger that cNp for an appropriate c > 0. The full details are 

straightforward 

but somewhat lengthy and are omitted. 

□ 

We will also make use of the following inequalities: 


log(l + t) <t for all t > —1 

(47) 

t — t^ < log(l -1- t) for alH > 

(48) 

log - < ^ for all t,s > 0 

s min{t, s| 

(49) 

—^ ~ ■ “ for s-ll f: s, 9 

't + e t' 'tpe' 't' 

(50) 


Proof of Theorem \B.l[ For x G V, denote by the set of neighbours of x in 
G. As indicated earlier, we shall use that fact that Ci is slightly biased towards 
either Pi or P 2 . Specifically, set (5 = and assume throughout the proof. 
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without loss of generality, that > A^ 2 - Then the following holds with high 
probability: 

AN = Ni- N2> Ni-^. (51) 

Indeed, note that Ni, as a function of the random partition, is hypergeometri- 
cally distributed with mean N/A and standard deviation of order N^. Hence, 
by the central limit theorem for the hypergeometric distribution (see [30]; m), 


N,--.N 


> 


4-<5 


(52) 


with N ^ oo. Statement (1521) guarantees a deviation from the mean, and in 
particular that (ISTI) holds with high probability. 

To prove the Theorem we now establish the following claim: 

Claim B.ll. Fix a partition Ci,C 2 of V, satisfying eq. m> and m- Under 
assumptions m and m, with probability at least 1 — graph G satisfies: For 
all i € Pi, 

> D{wi,p,c2)- (53) 

Note that the assumptions of the Claim depend only on randomness of the 
partitions and are satisfied with high probability. Indeed, (ED holds as discussed 
above and ED follows from Lemma (IB.71) . 

Once we prove the claim, by symmetry we will also have for all i € P 2 a. 
reverse inequality in ()53|1 . and together with (1511) this will prove the theorem. 
We proceed to prove the claim. 

Observe that by definition we have p,Ci (i) = for every i gV. 

Therefore we can rewrite (l53)) as: 

log liSl + (54) 

+d, log ^ (55) 

> 0 (56) 


We now bound the term ((55l) . Using (|4^ we obtain 


, dc2 
log — 

dci 


< 


\dc2 - dci I 

min{dc2:C^Ci}' 


Using (IMl) and (1^ we obtain that 

m.in{dc2,dci} > cN'^p, 


(57) 


(58) 


and that 

\dc 2 - clcil < cNlogNy/p. (59) 

In addition, recall that by Lemma FB. 5 1 di < cNp. Therefore we obtain that 


di log 


dc2 

dci 


< cNp 


N'log Ny/p 
c"N‘^p 


< c'" log N y/p < c'" log N 


(60) 
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for some constant c'" > 0. 

We now examine the term (l54l) . Using (l48l) . write 


rf(ugi) > d{j,C,)-d{j,C2) _ f d{j,Ci)-d{j,C2) y 

dij,C2)- d{j,C2) V d{j,C2) ) 


(61) 


Note that by (IMl) . ^ and therefore (H51) applies. We now replace the 

denominator in the first term of the right hand of (IFTl) by a quantity independent 
of namely by L as defined in (l30l) . Using (l50l) with s = d(j, Ci) — d{j,C 2 ), 
t = L and 9 = d{j, C 2 ) — L, write 


dU, Cl) - djj, C 2 ) ^ d{j, Cl) - d{j, C 2 ) \d{j, C 2 ) - L\ \d{j,Ci)-d{j,C 2 )\ 

d{j,C2) - L d{j,C2) ■ L 

(62) 

To summarize, we have obtained that 




dU:c7) 

d(j,Ci) — d(j,C2) 

\d(j,C2)-L\ \d(3,Ci)-d(j,C2)\ 

d(j,C2) L 

2 

_ 


> 


Note that the term 


qsfii 
satisfies 


f d{j,Ci)-d{j,C2) V 

V dU,C2) ) ■ 


(63) 

(64) 

(65) 

( 66 ) 


E 


d(j,Ci)-d(j,C2) 

L 


d2(»,Cl)-d2(»,C2) 

L 


(67) 


This term counts the number of 2-paths and is the heart of the proof. Before 
analysing it, we bound the other two terms in the inequality in (1551) . Plugging 
in the estimates from Lemma iB. 81 we obtain for (1651) that 


s^\dij,C2)-L\ \d{j,Ci)-d{j,C2)\ ^ ^/N^logN ^/N^logN 
d{j,C2) ■ L ^ Np ' Np 

Using the degree estimate form Lemma IB. 51 di < cNp, we thus get 


E 


mc2)-L\ 

d{j,C2) 


\d{j,C3)-d{j,C2)\ 

L 


< c(log A^)^ 


(69) 


for an appropriate c > 0. Similarly, for the term (I66p we have 



rf(j,Ci)-rf(j,C 2 ) 

d{3.C2) 


2 

< c - di ■ 


Np log^ N 
N^p^ 


< c ■ log^ N, 


(70) 


with some (perhaps different) c > 0. 
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We now proceed to obtain a lower bound on (1571) . The crucial property of 
length two path counts, c? 2 (*,C'i) and d 2 {i,C 2 ), that enables such a bound is 
that the difference between the expectations of these quantities is of larger order 
of magnitude than their fluctuations. 

Indeed, by Lemma iB.101 with probability at least 1 — c/N we have that 


d 2 {i,Ci) - d 2 {i,C 2 ) > Ed 2 (i,Ci) - Ed 2 (*,C 2 ) - 2cNp\ogN (71) 


for all i G Pi- In addition, by Lemma IB. 91 

Ed 2 {i, Cl) - Ed 2 {i, C 2 ) = \n^N{p - q)^ > N^/^-^{p - q)\ (72) 

where we have used (ISTI) in the last inequality. 

Incorporating the inequalities (1601) . (1691) . ([TOl), we obtain that D{wi,pci) > 
D{wi,pc 2 ) holds if the following inequality holds: 


N^C-Si^p _ qf _ 2cNplogN 


— clogN >0. 


(73) 


To prove the theorem, it remains to choose p and q such that (HSl) is satisfied. 
Such p, q are given by the assumptions dni), m- Indeed, recall that L satisfies 
L < cNp for an appropriate c > 0 and hence under assumptions (II3, (USD we 
have 

— q)'^ — 2cNp\ogN 

- ^-^ 00 (74) 

Llog 

with N ^ 00 , hence yielding dZSl). □ 


C LFR benchmarks 

In this section we specify the full parameters used for the experiments in the 
paper. 

The LFR model is generated from the following parameters: The graph size 
N, the mixing parameter p, community size lower and upper bounds Cmin, Cmax, 
average degree d, maximal degree dmax^ and the power law exponents for the 
degree and community size distributions - which are in all cases set to their 
default values of —2 and —1 respectively. In addition, in the overlapping case, 
parameter n specifies the number of nodes that will participate in multiple 
communities, and the parameter m specifies the number of communities in which 
each such node will participate. 

The LFR models were generated using the software available at (35]. 

For the non overlapping LFR benchmarks we have used d = 20 and dmax = 
50, with the rest of parameters as specified in Section 4.2. This corresponds 
precisely to the experiments in [J] . The repeats strategy is described in Section 
El For each given graph instance, DER was run with 15 repeats, using 3 restarts 
in each run. The results of the repeats were clustered using the threshold 
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algorithm described in Section El except in the /i = 0.7 in which we have 
used the spectral clustering to cluster the co-occurence matrix. 

The LFR experiments with the spectral clustering algorithm that are shown 
in Figure 2.b were performed using the spectral clustering version in Python 
sklearn vO.14.1 package, which is an implementation of the algorithm in [5]. 
The spectral clustering was run with 150 restarts of its final stage Euclidean 
k-means step. We note that while the repeats strategy could be applied to the 
spectral clustering too, it did not improve the performance in this case (despite 
the fact that different runs of spectral clustering returned somewhat different 
results). The results shown in Figure 2.b are without repeats. 

For the overlapping community benchmarks we have used the following set¬ 
tings: N — 10000, d = 60, dmax = 100, Cmin = 200, Cmax = 500. The value 
of n was 5000 and m was 4. These are the settings that were used in [10]. As 
discussed in the next section, in one sense these settings can be considered a 
heavy overlap, while there is a different sense in which they can be considered 
sparse. In all cases we have run DER with 15 repeats and 3 restarts per run, 
and we have used the spectral clustering to cluster the co-occurence matrix. 

Recall that our approach to overlapping communities is to first obtain a 
non-overlapping clustering and then to post-process it to obtain overlapping 
communities. One can ask therefore what will happen if in the non- overlapping 
step, DER is replaced by another non-overlapping clustering algorithm. We 
have tried using spectral clustering instead of DER, and applied the same post¬ 
processing. In all cases this resulted in ENMI values close to 0. 

D Overlapping LFR benchmarks 

We refer to [3] and m for the definitions of the LFR models. In this section 
we make a few brief comments regarding the structure of the overlapping LFR 
communities. 

To simplify the discussion, we restrict our attention to the particular settings 
that were used in the evaluation in Section 4.3. The settings n = 5000 and m = 4 
(see Section[C|) imply that there are 5000 such that each node belongs to a single 
community, and 5000 nodes such that each node belongs to 4 communities. 
These settings may be considered as a heavy overlap (see [33]). Indeed, it 
follows theoretically from the way LFR communities are generated, and also is 
observed in actual graphs, that under these settings each community C contains 
about 20% of nodes that belong only to C, and each of the remaining 80% of 
the nodes belongs to C and to 3 other communities. 

On the other hand, for a node i G C that belongs to 3 other communities, 
the 3 other communities are chosen at random among about 75 remaining com¬ 
munities of the graph. This implies that for each pair of communities C, J, the 
intersection between them is small and if a node i G F is chosen at random, the 
event i £ C is almost independent of the event i G J. 

The above small intersections and lack of correlations between communities 
property implies that random walk started from community C, after two steps 
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has a chance of about 1/16 of returning to C while the rest of the probability is 
distributed more or less uniformly between the other communities (and is much 
less than 1/16 for each community that is not C). In other words, the measures 
Wi and Wj have much more chance of being correlated if i and j belong to some 
common C than otherwise. This explains why DER works well on these graphs. 
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